Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: aronsson Date: Jan 30, 2005 9:11pm
Forum: toronto Subject: Re: Universal OCR

I looked around the IA text collections to find books pertaining to Scandinavia, which I can reuse in Project Runeberg, and immediately found three, which are now available at http://runeberg.org/pictswed/ , http://runeberg.org/ivar/ and http://runeberg.org/utveck/

Of these, the first one (Pictures of Sweden, by Hans Christian Andersen) has already been through proofreading at Distributed Proofreaders and is available from Project Gutenberg in TXT and HTML. However, for Project Runeberg I need to know where the page breaks and line breaks are, and this information is lost in the PG e-text, so now we are publishing the raw OCR and letting our volunteers proofread it anew. This is a waste of effort that I wish I knew how to avoid. Further, both PG/DP and Project Runeberg lose the pixel coordinates of each word that is available in the DjVu format.

One way out of this, would be to improve the proofreading processes of DP and Project Runeberg, so no information is lost. Another way might be to rebuild the information after it is lost. Perhaps something like the GNU wdiff (word difference) utility can be used to see which words have been moved, joint or changed during proofreading, and tying this back to the pixel coordinates of the original DjVu file. Has anybody tried this?

For example, the first line of raw OCR text of page 9 of the DjVu file UF00001842 reads:


I
is
a
delightf-ul
spring,
the
birds
warble,


and the proofed plain text at Project Gutenberg reads:

It is a delightful spring: the birds warble,

so the words "is" and "warble" matches unchanged. Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"? Would this be useful?

Reply to this post
Reply [edit]

Poster: aronsson Date: Jan 30, 2005 9:35pm
Forum: toronto Subject: Re: Universal OCR

Sorry, the formatting was lost there. What I wanted to say was this:

For example, the first line of raw OCR text of page 9 of the DjVu file UF00001842 reads:

(LINE)
(WORD coords="382,2455,466,2381")I(/WORD)
(WORD coords="511,2455,568,2380")is(/WORD)
(WORD coords="618,2455,660,2408")a(/WORD)
(WORD coords="705,2481,1077,2377")delightf-ul(/WORD)
(WORD coords="1132,2482,1418,2379")spring,(/WORD>
(WORD coords="1485,2456,1606,2382")the(/WORD>
(WORD coords="1652,2458,1848,2380")birds(/WORD>
(WORD coords="1901,2471,2171,2380")warble,(/WORD)
(/LINE)

and the proofed plain text at Project Gutenberg reads:

It is a delightful spring: the birds warble,

so the words "is", "birds", and "warble" match unchanged. Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"? Would this be useful?

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or Staffbrewster Date: Jan 30, 2005 10:54pm
Forum: toronto Subject: Re: Universal OCR

If there were a tool to go from

djvuxml -> distributed proofreaders -> djvuxml
and preserve as many bounding boxes as possible (some of that will be difficult or impossible, so it is not that important that it have all)

Then we have a set of images-of-words and unicode-words-- or you can think of it as a training set for OCR.

We have gotten interest from the Machine Learning folks in making a universal OCR engine out of this.

What would be particularly interesting is non-roman scripts, so we may need to construct the DJVUxml more from scratch.

If anyone is interested in this, please let us know by forum post, email, or phoning the archive.

-brewster

Reply to this post
Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:24am
Forum: toronto Subject: Re: Universal OCR

"However, for Project Runeberg I need to know where the page breaks and line breaks are, and this information is lost in the PG e-text"

DP now tries to retain at least page numbers in its HTML versions (though they are unlikely to appear at the exact page boundaries all the time, because we reconnect words that were broken across page boundaries). Also, footnotes, columns and other items that span pages are unlikely to be in the right position, so to speak.

In other words, when sending a text through DP, it is not unreasonable to ask our volunteers to retain page breaks.

"Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"?"

I don't see why not.

"Would this be useful?"

I think it is.

Reply to this post
Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:34am
Forum: toronto Subject: Re: Universal OCR

BTW, you could use DP just for proofreading. During proofreading rounds, we retain line breaks to make it easier for our volunteers to compare the text with the scan. Line breaks are only removed during the post-processing round.